Search CORE

International Migration, Integration and Social Cohesion online publications

Simplivariate Models: Ideas and First Examples

Author: Age K. Smilde
AK Smilde
BGM Vandeginste
EJ Want
H Turner
HA Chipman
HL Turner
JA Hageman
JC Lindon
Johan A. Westerhuis
Jos A. Hageman
L Lazzeroni
Margriet M. W. B. Hendriks
Mariët J. van der Werf
Mark Isalan
MJ van der Werf
O Fiehn
R Bro
RA van den Berg
RA van den Berg
Ruud Berger
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

One of the new expanding areas in functional genomics is metabolomics: measuring the metabolome of an organism. Data being generated in metabolomics studies are very diverse in nature depending on the design underlying the experiment. Traditionally, variation in measurements is conceptually broken down in systematic variation and noise where the latter contains, e.g. technical variation. There is increasing evidence that this distinction does not hold (or is too simple) for metabolomics data. A more useful distinction is in terms of informative and non-informative variation where informative relates to the problem being studied. In most common methods for analyzing metabolomics (or any other high-dimensional x-omics) data this distinction is ignored thereby severely hampering the results of the analysis. This leads to poorly interpretable models and may even obscure the relevant biological information. We developed a framework from first data analysis principles by explicitly formulating the problem of analyzing metabolomics data in terms of informative and non-informative parts. This framework allows for flexible interactions with the biologists involved in formulating prior knowledge of underlying structures. The basic idea is that the informative parts of the complex metabolomics data are approximated by simple components with a biological meaning, e.g. in terms of metabolic pathways or their regulation. Hence, we termed the framework ‘simplivariate models’ which constitutes a new way of looking at metabolomics data. The framework is given in its full generality and exemplified with two methods, IDR analysis and plaid modeling, that fit into the framework. Using this strategy of ‘divide and conquer’, we show that meaningful simplivariate models can be obtained using a real-life microbial metabolomics data set. For instance, one of the simple components contained all the measured intermediates of the Krebs cycle of E. coli. Moreover, these simplivariate models were able to uncover regulatory mechanisms present in the phenylalanine biosynthesis route of E. coli

Public Library of Science (PLOS)

Wageningen University & Research Publications

International Migration, Integration and Social Cohesion online publications

A structured overview of simultaneous component based data integration

Author: Age K Smilde
AK Smilde
AK Smilde
AK Smilde
B Escofier
B Escofier
C Lavit
E Segal
EM Qannari
H L'Hermier des Plantes
HA Kiers
HA Kiers
HAL Kiers
HAL Kiers
Henk AL Kiers
HF Kaiser
I Stanimirova
Iven Van Mechelen
J Levin
J Pagès
JA Hageman
K Lemmens
Katrijn Van Deun
M de Tayrac
M Meyners
Mariët J van der Werf
MJvan der Werf
P Robert
R Bro
S de Jong
T Dahl
U Lorenzo-Seva
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Data integration is currently one of the main challenges in the biomedical sciences. Often different pieces of information are gathered on the same set of entities (e.g., tissues, culture samples, biomolecules) with the different pieces stemming, for example, from different measurement techniques. This implies that more and more data appear that consist of two or more data arrays that have a shared mode. An integrative analysis of such coupled data should be based on a simultaneous analysis of all data arrays. In this respect, the family of simultaneous component methods (e.g., SUM-PCA, unrestricted PCovR, MFA, STATIS, and SCA-P) is a natural choice. Yet, different simultaneous component methods may lead to quite different results. Results We offer a structured overview of simultaneous component methods that frames them in a principal components setting such that both the common core of the methods and the specific elements with regard to which they differ are highlighted. An overview of principles is given that may guide the data analyst in choosing an appropriate simultaneous component method. Several theoretical and practical issues are illustrated with an empirical example on metabolomics data for <it>Escherichia coli </it>as obtained with different analytical chemical measurement methods. Conclusion Of the aspects in which the simultaneous component methods differ, pre-processing and weighting are consequential. Especially, the type of weighting of the different matrices is essential for simultaneous component analysis. These types are shown to be linked to different specifications of the idea of a fair integration of the different coupled arrays.</p

University of Groningen

International Migration, Integration and Social Cohesion online publications

Proceedings - University of Groningen

Springer - Publisher Connector

ARTS repository - University of Groningen

Dissertations of the University of Groningen

DISCO-SCA and Properly Applied GSVD as Swinging Methods to Find Common and Distinctive Processes

Author: A Subramanian
A Tanay
Age K. Smilde
AK Smilde
Anna Tramontano
Bart De Moor
C Hennig
CC Paige
CF Van Loan
HAL Kiers
HAL Kiers
HAL Kiers
Henk A. L. Kiers
I Måge
IT Jolliffe
Iven Van Mechelen
J Ihmels
J Westerhuis
JA Hageman
JM Stuart
K Devarajan
K Lemmens
K Van Deun
KA Bernstein
Katrijn Van Deun
Lieven De Lathauwer
Lieven Thorrez
M Schouteden
Mariët J. van der Werf
Martijn Schouteden
ME Timmerman
MJ van der Werf
MW Browne
NS Holter
O Alter
P Howland
P Tamayo
RA van den Berg
S Bergmann
S Friedland
SP Ponnapalli
T Dahl
T Löfstedt
U Lorenzo-Seva
VK Mootha
Z Bai
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

BACKGROUND: In systems biology it is common to obtain for the same set of biological entities information from multiple sources. Examples include expression data for the same set of orthologous genes screened in different organisms and data on the same set of culture samples obtained with different high-throughput techniques. A major challenge is to find the important biological processes underlying the data and to disentangle therein processes common to all data sources and processes distinctive for a specific source. Recently, two promising simultaneous data integration methods have been proposed to attain this goal, namely generalized singular value decomposition (GSVD) and simultaneous component analysis with rotation to common and distinctive components (DISCO-SCA). RESULTS: Both theoretical analyses and applications to biologically relevant data show that: (1) straightforward applications of GSVD yield unsatisfactory results, (2) DISCO-SCA performs well, (3) provided proper pre-processing and algorithmic adaptations, GSVD reaches a performance level similar to that of DISCO-SCA, and (4) DISCO-SCA is directly generalizable to more than two data sources. The biological relevance of DISCO-SCA is illustrated with two applications. First, in a setting of comparative genomics, it is shown that DISCO-SCA recovers a common theme of cell cycle progression and a yeast-specific response to pheromones. The biological annotation was obtained by applying Gene Set Enrichment Analysis in an appropriate way. Second, in an application of DISCO-SCA to metabolomics data for Escherichia coli obtained with two different chemical analysis platforms, it is illustrated that the metabolites involved in some of the biological processes underlying the data are detected by one of the two platforms only; therefore, platforms for microbial metabolomics should be tailored to the biological question. CONCLUSIONS: Both DISCO-SCA and properly applied GSVD are promising integrative methods for finding common and distinctive processes in multisource data. Open source code for both methods is provided

University of Groningen

International Migration, Integration and Social Cohesion online publications

FigShare

Public Library of Science (PLOS)

Proceedings - University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Simplivariate Models: Uncovering the Underlying Biology in Functional Genomics Data

Author: A Buko
A Gordon
A Raftery
A Smilde
A Smilde
Age K. Smilde
Arkady B. Khodursky
B Selman
C Lucasius
C Lucasius
C Merlin
C Sands
D Anderson
D Bueschkens
D Camacho
D Livingstone
D Witten
E Wood
Edoardo Saccenti
G Golub
H Chipman
H Turner
H Turner
I Jolliffe
I Keseler
J Hageman
J Heijenoort
J Nicholson
J Schott
J Topliss
JA Hageman
Johan A. Westerhuis
Jos A. Hageman
K De Jong
K Van Deun
L Breiman
L Kaufman
L Lazzeroni
M Assfalg
Margriet M. W. B. Hendriks
Mariët J. van der Werf
P Bernini
P Hallgren
P Harrington
R Steuer
R Tibshirani
R van den Berg
R Wildman
S Salvador
T Knijnenburg
T Watanabe
X Gao
Y Wang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

One of the first steps in analyzing high-dimensional functional genomics data is an exploratory analysis of such data. Cluster Analysis and Principal Component Analysis are then usually the method of choice. Despite their versatility they also have a severe drawback: they do not always generate simple and interpretable solutions. On the basis of the observation that functional genomics data often contain both informative and non-informative variation, we propose a method that finds sets of variables containing informative variation. This informative variation is subsequently expressed in easily interpretable simplivariate components

CiteSeerX

Public Library of Science (PLOS)